Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.

Description of variables

ID de caso: ID of the confirmed case.

Fecha de diagnóstico: Date in which the disease was diagnosed.

Ciudad de ubicación: City where the case was diagnosed.

Departamento o Distrito: Department or district where the city belongs to.

Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.

Edad: Age of the confirmed case.

Sexo: Sex of the confirmed case.

Tipo: How the person got infected: in Colombia, abroad or unknown.

País de procedencia: Country of origin if the person got infected abroad.

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

We had to clean the dataset:

  • We transformed the Fecha de diagnóstico variable into a Date type variable,

  • we fixed the variable Id de caso (some rows were missing so the numbers weren’t consecutive),

  • we created a variable Grupo de edad,

  • we cleaned the column País de procedencia (replaced cities with the country) and created the variable Continente de procedencia (as the first is too fragmented we thought to consider the continents).

##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           2020-03-06              Bogotá             Bogotá D.C.
## 2          2           2020-03-09                Buga         Valle del Cauca
## 3          3           2020-03-09            Medellín               Antioquia
##     Atención Edad Sexo      Tipo País de procedencia Grupo de edad
## 1 Recuperado   19    F Importado              Italia         19_30
## 2 Recuperado   34    M Importado              España         31_45
## 3 Recuperado   50    F Importado              España         46_60
##   Continente de procedencia
## 1                    Europa
## 2                    Europa
## 3                    Europa

New dataset I

##          Date Elapsed time New cases/day Cumulative cases
## 1  2020-03-06            0             1                1
## 2  2020-03-09            3             2                3
## 3  2020-03-11            5             5                8
## 4  2020-03-12            6             2               10
## 5  2020-03-13            7             3               13
## 6  2020-03-14            8             8               21
## 7  2020-03-15            9            13               34
## 8  2020-03-16           10            12               46
## 9  2020-03-17           11            12               58
## 10 2020-03-18           12            24               82

New dataset II

#million inhabitants per department
cases_dep$people<-rep(0, nrow(cases_dep))

cases_dep[which(cases_dep$`Department ID` == 1),]$people<-6.4
cases_dep[which(cases_dep$`Department ID` == 2),]$people<-7.4
cases_dep[which(cases_dep$`Department ID` == 3),]$people<-1.2
cases_dep[which(cases_dep$`Department ID` == 4),]$people<-0.99
cases_dep[which(cases_dep$`Department ID` == 5),]$people<-0.4
cases_dep[which(cases_dep$`Department ID` == 6),]$people<-1.4
cases_dep[which(cases_dep$`Department ID` == 7),]$people<-2.9
cases_dep[which(cases_dep$`Department ID` == 8),]$people<-1.04
cases_dep[which(cases_dep$`Department ID` == 9),]$people<-0.53
cases_dep[which(cases_dep$`Department ID` == 10),]$people<-0.94
cases_dep[which(cases_dep$`Department ID` == 11),]$people<-2.18
cases_dep[which(cases_dep$`Department ID` == 12),]$people<-1.33
cases_dep[which(cases_dep$`Department ID` == 13),]$people<-4.4

#km^2
antioquia_surface<-63600
bogota_surface<-1775
boyaca_surface<-23189
caldas_surface<-7888
casanare_surface<-44640
cauca_surface<-29308
cundinamarca_surface<-24210
meta_surface<-85635
quindio_surface<-1845
risaralda_surface<-4140
santander_surface<-30537
tolima_surface<-23562
valle_cauca_surface<-22195

#population density inhabitants/km^2
antioquia_density<-88.06
bogota_density<-4552
boyaca_density<-93
caldas_density<-130
casanare_density<-9.4
cauca_density<-50
cundinamarca_density<-99.15
meta_density<-density<-12
quindio_density<-290
risaralda_density<-59.16
santander_density<-72
tolima_density<-56
valle_cauca_density<-183.04

cases_dep$surface<-rep(0, nrow(cases_dep))
cases_dep$density<-rep(0, nrow(cases_dep))

cases_dep[which(cases_dep$`Department ID` ==1),]$surface<-antioquia_surface
cases_dep[which(cases_dep$`Department ID` ==2),]$surface<-bogota_surface
cases_dep[which(cases_dep$`Department ID` ==3),]$surface<-boyaca_surface
cases_dep[which(cases_dep$`Department ID` ==4),]$surface<-caldas_surface
cases_dep[which(cases_dep$`Department ID` ==5),]$surface<-casanare_surface
cases_dep[which(cases_dep$`Department ID` ==6),]$surface<-cauca_surface
cases_dep[which(cases_dep$`Department ID` ==7),]$surface<-cundinamarca_surface
cases_dep[which(cases_dep$`Department ID` ==8),]$surface<-meta_surface
cases_dep[which(cases_dep$`Department ID` ==9),]$surface<-quindio_surface
cases_dep[which(cases_dep$`Department ID` ==10),]$surface<-risaralda_surface
cases_dep[which(cases_dep$`Department ID` ==11),]$surface<-santander_surface
cases_dep[which(cases_dep$`Department ID` ==12),]$surface<-tolima_surface
cases_dep[which(cases_dep$`Department ID` ==13),]$surface<-valle_cauca_surface

cases_dep[which(cases_dep$`Department ID` == 1),]$density<-antioquia_density
cases_dep[which(cases_dep$`Department ID` ==2),]$density<-bogota_density
cases_dep[which(cases_dep$`Department ID` ==3),]$density<-boyaca_density
cases_dep[which(cases_dep$`Department ID` ==4),]$density<-caldas_density
cases_dep[which(cases_dep$`Department ID` ==5),]$density<-casanare_density
cases_dep[which(cases_dep$`Department ID` ==6),]$density<-cauca_density
cases_dep[which(cases_dep$`Department ID` ==7),]$density<-cundinamarca_density
cases_dep[which(cases_dep$`Department ID` ==8),]$density<-meta_density
cases_dep[which(cases_dep$`Department ID` ==9),]$density<-quindio_density
cases_dep[which(cases_dep$`Department ID` ==10),]$density<-risaralda_density
cases_dep[which(cases_dep$`Department ID` ==11),]$density<-santander_density
cases_dep[which(cases_dep$`Department ID` ==12),]$density<-tolima_density
cases_dep[which(cases_dep$`Department ID` ==13),]$density<-valle_cauca_density
##          Date Elapsed time Department Department ID New cases/day
## 1  2020-03-09            3  Antioquia             1             1
## 2  2020-03-11            5  Antioquia             1             3
## 3  2020-03-14            8  Antioquia             1             3
## 4  2020-03-15            9  Antioquia             1             1
## 5  2020-03-19           13  Antioquia             1             3
## 6  2020-03-20           14  Antioquia             1            11
## 7  2020-03-21           15  Antioquia             1             3
## 8  2020-03-22           16  Antioquia             1             5
## 9  2020-03-23           17  Antioquia             1            22
## 10 2020-03-25           19  Antioquia             1             8
##    Cumulative cases/Department Mean age people surface density
## 1                            1 50.00000    6.4   63600   88.06
## 2                            4 35.66667    6.4   63600   88.06
## 3                            7 30.00000    6.4   63600   88.06
## 4                            8 55.00000    6.4   63600   88.06
## 5                           11 52.33333    6.4   63600   88.06
## 6                           22 39.81818    6.4   63600   88.06
## 7                           25 31.00000    6.4   63600   88.06
## 8                           30 45.40000    6.4   63600   88.06
## 9                           52 36.45455    6.4   63600   88.06
## 10                          60 29.75000    6.4   63600   88.06
##          Date Elapsed time   Department Department ID New cases/day
## 1  2020-03-09            3    Antioquia             1             1
## 2  2020-03-11            5    Antioquia             1             3
## 16 2020-04-02           27    Antioquia             1            20
## 17 2020-03-06            0  Bogotá D.C.             2             1
## 18 2020-03-11            5  Bogotá D.C.             2             2
## 40 2020-04-02           27  Bogotá D.C.             2            70
## 41 2020-03-15            9 Cundinamarca             7             1
## 42 2020-03-16           10 Cundinamarca             7             1
## 54 2020-04-01           26 Cundinamarca             7             4
## 55 2020-03-15            9    Risaralda            10             1
##    Cumulative cases/Department Mean age people surface density
## 1                            1 50.00000   6.40   63600   88.06
## 2                            4 35.66667   6.40   63600   88.06
## 16                         127 44.80000   6.40   63600   88.06
## 17                           1 19.00000   7.40    1775 4552.00
## 18                           3 25.00000   7.40    1775 4552.00
## 40                         542 43.18571   7.40    1775 4552.00
## 41                           1 29.00000   2.90   24210   99.15
## 42                           2 24.00000   2.90   24210   99.15
## 54                          42 30.75000   2.90   24210   99.15
## 55                           1 22.00000   0.94    4140   59.16

Exploring the dataset


Other plots



Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).



Age-Sex plot

Tipo

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   Tipo        `Total number` Percentage
##   <fct>                <int> <chr>     
## 1 Relacionado            291 29.3%     
## 2 Importado              467 47.0%     
## 3 En estudio             235 23.7%

Continent

Now let’s plot a pie chart to be able to see the distribution of cases across the continents.


The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.

The frequentist approach

Angela’s attempt

## 
## Call:
## glm(formula = `Cumulative cases/Department` ~ `Elapsed time`, 
##     family = poisson, data = cases_relev_dep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.9994   -4.7902   -2.1236    0.8586   16.3553  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.982659   0.062709   15.67   <2e-16 ***
## `Elapsed time` 0.167306   0.002779   60.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 8732.6  on 82  degrees of freedom
## Residual deviance: 3855.1  on 81  degrees of freedom
## AIC: 4285.4
## 
## Number of Fisher Scoring iterations: 5

New

## [1] "AIC: 317.731647373115"
## [1] "Null deviance:  870.39"    "Residual deviance: 191.78"

ANOVA to compare the Poisson models

## Analysis of Deviance Table
## 
## Model 1: `Cumulative cases` ~ `Elapsed time`
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        23    280.624                          
## 2        18    200.747  5   79.878 8.901e-16 ***
## 3         6      3.303 12  197.444 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Negative Binomial with Elapsed time, Age and Departments as pedictors


## [1] "Estimated overdispersion 3.85395829427517"
## [1] "Null deviance:  7858.91" "Residual deviance: 3.3"

Applying ANOVA to compare the negative binomial models

We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the fifth model is in fact better than the first model.

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Cumulative cases
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Elapsed time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                     `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n    `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
##          theta Resid. df    2 x log-lik.   Test    df  LR stat.      Pr(Chi)
## 1 1.125073e+01        23       -253.9080                                    
## 2 1.257832e+01        18       -251.9442 1 vs 2     5  1.963751 8.541378e-01
## 3 2.453680e+06         6       -165.5091 2 vs 3    12 86.435145 2.409184e-13

Predictive accuracy of the Poisson model

Predicting with a \(95\%\) confidence interval

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

The Bayesian approach

Poisson regression

As a first attempt, we fit a simple Poisson regression:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.

For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.

Posterior predictive check

poisson_posterior-1

The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.

Residual check

first_residual-1

The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).

The plot of the standardized residuals indicates a large amount of overdispersion.

Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson’s one.

Negative Binomial model

We try to improve the previous model using the Negative Binomial model:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

Where the parameter \(\phi\) is called precision and it is such that:

\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]

again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.

The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.

Accuracy across departments

NB_deps

We should take into account the differences across departments.

Multilevel Negative Binomial regression

We try to fit the following model, which also includes Age as covariat:

\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]

Hierarchical model

In order to improve the fit, we fit a model with department-specific intercept term.

So the varying intercept model that we take into account is now:

\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]

The priors used for the above model are the following:

\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]

being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).

New dataset

We added the following covariats into the dataset:

  • people: millions of inhabitants for each region;

  • surface: \(km^3\), extent of each region;

  • density: \(\frac{people}{km^2}\), density of the population in each region.

Accuracy across departments

hier_deps

We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.

LOOIC

The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.

Plot the looic to compare models:

looic